Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [http://machinelearningmastery.com/]
SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Online News Popularity dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.
INTRODUCTION: This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the article’s popularity level in social networks. The dataset does not contain the original content but some statistics associated with it. The original content can be publicly accessed and retrieved using the provided URLs.
Many thanks to K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal, for making the dataset and benchmarking information available.
In iteration Take1, the script focused on evaluating various machine learning algorithms and identifying the algorithm that produces the best accuracy result. Iteration Take1 established a baseline performance regarding accuracy and processing time.
For this iteration, we will examine the feasibility of using a dimensionality reduction technique of ranking the attribute importance with a gradient boosting tree method. Afterward, we will eliminate the features that do not contribute to the cumulative importance of 0.99 (or 99%).
ANALYSIS: From the previous iteration Take1, the baseline performance of the machine learning algorithms achieved an average accuracy of 64.53%. Three algorithms (Random Forest, AdaBoost, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 67.48%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 66.71%, which was just slightly below the training data.
In the current iteration, the baseline performance of the machine learning algorithms achieved an average accuracy of 64.29%. Two ensemble algorithms (Random Forest and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 67.51%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 66.53%, which was just slightly below the accuracy of the training data.
From the model-building activities, the number of attributes went from 58 down to 42 after eliminating 16 attributes. The processing time went from 6 hours 31 minutes in iteration Take1 down to 3 hours 18 minutes in iteration Take2, which was a reduction of 49% from Take1.
CONCLUSION: The feature selection techniques helped by cutting down the attributes and reduced the training time. Furthermore, the modeling took a much shorter time to process yet still retained a comparable level of accuracy. For this dataset, the Stochastic Gradient Boosting algorithm and the attribute importance ranking technique should be considered for further modeling or production use.
Dataset Used: Online News Popularity Dataset
Dataset ML Model: Binary classification with numerical attributes
Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity
One potential source of performance benchmarks: [Benchmark URL - https://www.kaggle.com/uciml/pima-indians-diabetes-database]
The project aims to touch on the following areas:
Any predictive modeling machine learning project genrally can be broken down into about six major tasks:
startTimeScript <- proc.time()
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(corrplot)
## corrplot 0.84 loaded
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(parallel)
library(mailR)
# Create one random seed number for reproducible results
seedNum <- 888
set.seed(seedNum)
originalDataset <- read.csv("OnlineNewsPopularity.csv", header= TRUE)
# Using the "shares" column to set up the target variable column
# targetVar <- 0 when shares < 1400, targetVar <- 1 when shares >= 1400
originalDataset$targetVar <- 0
originalDataset$targetVar[originalDataset$shares>=1400] <- 1
originalDataset$targetVar <- as.factor(originalDataset$targetVar)
originalDataset$shares <- NULL
# Dropping the two non-predictive attributes: url and timedelta
originalDataset$url <- NULL
originalDataset$timedelta <- NULL
# Different ways of reading and processing the input dataset. Saving these for future references.
#x_train <- read.fwf("X_train.txt", widths = widthVector, col.names = colNames)
#y_train <- read.csv("y_train.txt", header = FALSE, col.names = c("targetVar"))
#y_train$targetVar <- as.factor(y_train$targetVar)
#xy_train <- cbind(x_train, y_train)
# Use variable totCol to hold the number of columns in the dataframe
totCol <- ncol(originalDataset)
# Set up variable totAttr for the total number of attribute columns
totAttr <- totCol-1
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# if (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization!
targetCol <- totCol
#colnames(originalDataset)[targetCol] <- "targetVar"
# We create training datasets (xy_train, x_train, y_train) for various operations.
# We create validation datasets (xy_test, x_test, y_test) for various operations.
set.seed(seedNum)
# Create a list of the rows in the original dataset we can use for training
training_index <- createDataPartition(originalDataset$targetVar, p=0.70, list=FALSE)
# Use 70% of the data to train the models and the remaining for testing/validation
xy_train <- originalDataset[training_index,]
xy_test <- originalDataset[-training_index,]
if (targetCol==1) {
x_train <- xy_train[,(targetCol+1):totCol]
y_train <- xy_train[,targetCol]
y_test <- xy_test[,targetCol]
} else {
x_train <- xy_train[,1:(totAttr)]
y_train <- xy_train[,totCol]
y_test <- xy_test[,totCol]
}
# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol <- 8
if (totAttr%%dispCol == 0) {
dispRow <- totAttr%/%dispCol
} else {
dispRow <- (totAttr%/%dispCol) + 1
}
cat("Will attempt to create graphics grid (col x row): ", dispCol, ' by ', dispRow)
## Will attempt to create graphics grid (col x row): 8 by 8
# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10, repeats=1)
metricTarget <- "Accuracy"
email_notify <- function(msg=""){
sender <- "luozhi2488@gmail.com"
receiver <- "dave@contactdavidlowe.com"
sbj_line <- "Notification from R Script"
password <- readLines("email_credential.txt")
send.mail(
from = sender,
to = receiver,
subject= sbj_line,
body = msg,
smtp = list(host.name = "smtp.gmail.com", port = 465, user.name = sender, passwd = password, ssl = TRUE),
authenticate = TRUE,
send = TRUE)
}
email_notify(paste("Library and Data Loading Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2aafb23c}"
To gain a better understanding of the data that we have on-hand, we will leverage a number of descriptive statistics and data visualization techniques. The plan is to use the results to consider new questions, review assumptions, and validate hypotheses that we can investigate later with specialized models.
head(xy_train)
## n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words
## 2 9 255 0.6047431 1
## 3 9 211 0.5751295 1
## 4 9 531 0.5037879 1
## 7 8 960 0.4181626 1
## 10 10 231 0.6363636 1
## 11 9 1248 0.4900498 1
## n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs num_videos
## 2 0.7919463 3 1 1 0
## 3 0.6638655 3 1 1 0
## 4 0.6656347 9 0 1 0
## 7 0.5498339 21 20 20 0
## 10 0.7971014 4 1 1 1
## 11 0.7316384 11 0 1 0
## average_token_length num_keywords data_channel_is_lifestyle
## 2 4.913725 4 0
## 3 4.393365 6 0
## 4 4.404896 7 0
## 7 4.654167 10 1
## 10 5.090909 5 0
## 11 4.617788 8 0
## data_channel_is_entertainment data_channel_is_bus
## 2 0 1
## 3 0 1
## 4 1 0
## 7 0 0
## 10 0 0
## 11 0 0
## data_channel_is_socmed data_channel_is_tech data_channel_is_world
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 7 0 0 0
## 10 0 0 1
## 11 0 0 1
## kw_min_min kw_max_min kw_avg_min kw_min_max kw_max_max kw_avg_max
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 7 0 0 0 0 0 0
## 10 0 0 0 0 0 0
## 11 0 0 0 0 0 0
## kw_min_avg kw_max_avg kw_avg_avg self_reference_min_shares
## 2 0 0 0 0
## 3 0 0 0 918
## 4 0 0 0 0
## 7 0 0 0 545
## 10 0 0 0 0
## 11 0 0 0 0
## self_reference_max_shares self_reference_avg_sharess weekday_is_monday
## 2 0 0.000 1
## 3 918 918.000 1
## 4 0 0.000 1
## 7 16000 3151.158 1
## 10 0 0.000 1
## 11 0 0.000 1
## weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 7 0 0 0
## 10 0 0 0
## 11 0 0 0
## weekday_is_friday weekday_is_saturday weekday_is_sunday is_weekend
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 7 0 0 0 0
## 10 0 0 0 0
## 11 0 0 0 0
## LDA_00 LDA_01 LDA_02 LDA_03 LDA_04
## 2 0.79975569 0.05004668 0.05009625 0.05010067 0.05000071
## 3 0.21779229 0.03333446 0.03335142 0.03333354 0.68218829
## 4 0.02857322 0.41929964 0.49465083 0.02890472 0.02857160
## 7 0.02008167 0.11470539 0.02002437 0.02001533 0.82517325
## 10 0.04000010 0.04000003 0.83999721 0.04000063 0.04000204
## 11 0.02500356 0.28730114 0.40082932 0.26186375 0.02500223
## global_subjectivity global_sentiment_polarity
## 2 0.3412458 0.14894781
## 3 0.7022222 0.32333333
## 4 0.4298497 0.10070467
## 7 0.5144803 0.26830272
## 10 0.3138889 0.05185185
## 11 0.4820598 0.10235015
## global_rate_positive_words global_rate_negative_words
## 2 0.04313725 0.015686275
## 3 0.05687204 0.009478673
## 4 0.04143126 0.020715631
## 7 0.08020833 0.016666667
## 10 0.03896104 0.030303030
## 11 0.03846154 0.020833333
## rate_positive_words rate_negative_words avg_positive_polarity
## 2 0.7333333 0.2666667 0.2869146
## 3 0.8571429 0.1428571 0.4958333
## 4 0.6666667 0.3333333 0.3859652
## 7 0.8279570 0.1720430 0.4020386
## 10 0.5625000 0.4375000 0.2984127
## 11 0.6486486 0.3513514 0.4044801
## min_positive_polarity max_positive_polarity avg_negative_polarity
## 2 0.03333333 0.7 -0.1187500
## 3 0.10000000 1.0 -0.4666667
## 4 0.13636364 0.8 -0.3696970
## 7 0.10000000 1.0 -0.2244792
## 10 0.10000000 0.5 -0.2380952
## 11 0.10000000 1.0 -0.4150641
## min_negative_polarity max_negative_polarity title_subjectivity
## 2 -0.125 -0.1000000 0
## 3 -0.800 -0.1333333 0
## 4 -0.600 -0.1666667 0
## 7 -0.500 -0.0500000 0
## 10 -0.500 -0.1000000 0
## 11 -1.000 -0.1000000 0
## title_sentiment_polarity abs_title_subjectivity
## 2 0 0.5
## 3 0 0.5
## 4 0 0.5
## 7 0 0.5
## 10 0 0.5
## 11 0 0.5
## abs_title_sentiment_polarity targetVar
## 2 0 0
## 3 0 1
## 4 0 0
## 7 0 0
## 10 0 0
## 11 0 1
dim(xy_train)
## [1] 27751 59
sapply(xy_train, class)
## n_tokens_title n_tokens_content
## "numeric" "numeric"
## n_unique_tokens n_non_stop_words
## "numeric" "numeric"
## n_non_stop_unique_tokens num_hrefs
## "numeric" "numeric"
## num_self_hrefs num_imgs
## "numeric" "numeric"
## num_videos average_token_length
## "numeric" "numeric"
## num_keywords data_channel_is_lifestyle
## "numeric" "numeric"
## data_channel_is_entertainment data_channel_is_bus
## "numeric" "numeric"
## data_channel_is_socmed data_channel_is_tech
## "numeric" "numeric"
## data_channel_is_world kw_min_min
## "numeric" "numeric"
## kw_max_min kw_avg_min
## "numeric" "numeric"
## kw_min_max kw_max_max
## "numeric" "numeric"
## kw_avg_max kw_min_avg
## "numeric" "numeric"
## kw_max_avg kw_avg_avg
## "numeric" "numeric"
## self_reference_min_shares self_reference_max_shares
## "numeric" "numeric"
## self_reference_avg_sharess weekday_is_monday
## "numeric" "numeric"
## weekday_is_tuesday weekday_is_wednesday
## "numeric" "numeric"
## weekday_is_thursday weekday_is_friday
## "numeric" "numeric"
## weekday_is_saturday weekday_is_sunday
## "numeric" "numeric"
## is_weekend LDA_00
## "numeric" "numeric"
## LDA_01 LDA_02
## "numeric" "numeric"
## LDA_03 LDA_04
## "numeric" "numeric"
## global_subjectivity global_sentiment_polarity
## "numeric" "numeric"
## global_rate_positive_words global_rate_negative_words
## "numeric" "numeric"
## rate_positive_words rate_negative_words
## "numeric" "numeric"
## avg_positive_polarity min_positive_polarity
## "numeric" "numeric"
## max_positive_polarity avg_negative_polarity
## "numeric" "numeric"
## min_negative_polarity max_negative_polarity
## "numeric" "numeric"
## title_subjectivity title_sentiment_polarity
## "numeric" "numeric"
## abs_title_subjectivity abs_title_sentiment_polarity
## "numeric" "numeric"
## targetVar
## "factor"
summary(xy_train)
## n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words
## Min. : 3.0 Min. : 0.0 Min. :0.0000 Min. :0.0000
## 1st Qu.: 9.0 1st Qu.: 247.0 1st Qu.:0.4703 1st Qu.:1.0000
## Median :10.0 Median : 411.0 Median :0.5389 Median :1.0000
## Mean :10.4 Mean : 549.5 Mean :0.5301 Mean :0.9704
## 3rd Qu.:12.0 3rd Qu.: 720.0 3rd Qu.:0.6078 3rd Qu.:1.0000
## Max. :23.0 Max. :8474.0 Max. :1.0000 Max. :1.0000
## n_non_stop_unique_tokens num_hrefs num_self_hrefs
## Min. :0.0000 Min. : 0.00 Min. : 0.000
## 1st Qu.:0.6254 1st Qu.: 4.00 1st Qu.: 1.000
## Median :0.6905 Median : 8.00 Median : 2.000
## Mean :0.6726 Mean : 10.94 Mean : 3.296
## 3rd Qu.:0.7544 3rd Qu.: 14.00 3rd Qu.: 4.000
## Max. :1.0000 Max. :304.00 Max. :116.000
## num_imgs num_videos average_token_length num_keywords
## Min. : 0.000 Min. : 0.000 Min. :0.000 Min. : 1.000
## 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.:4.479 1st Qu.: 6.000
## Median : 1.000 Median : 0.000 Median :4.666 Median : 7.000
## Mean : 4.557 Mean : 1.262 Mean :4.550 Mean : 7.214
## 3rd Qu.: 4.000 3rd Qu.: 1.000 3rd Qu.:4.855 3rd Qu.: 9.000
## Max. :128.000 Max. :91.000 Max. :7.696 Max. :10.000
## data_channel_is_lifestyle data_channel_is_entertainment
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000
## Mean :0.05236 Mean :0.1786
## 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000
## data_channel_is_bus data_channel_is_socmed data_channel_is_tech
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1591 Mean :0.0569 Mean :0.1865
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## data_channel_is_world kw_min_min kw_max_min kw_avg_min
## Min. :0.0000 Min. : -1.00 Min. : 0 Min. : -1.0
## 1st Qu.:0.0000 1st Qu.: -1.00 1st Qu.: 445 1st Qu.: 140.6
## Median :0.0000 Median : -1.00 Median : 660 Median : 234.5
## Mean :0.2131 Mean : 25.78 Mean : 1161 Mean : 312.6
## 3rd Qu.:0.0000 3rd Qu.: 4.00 3rd Qu.: 1000 3rd Qu.: 355.8
## Max. :1.0000 Max. :318.00 Max. :298400 Max. :42827.9
## kw_min_max kw_max_max kw_avg_max kw_min_avg
## Min. : 0 Min. : 0 Min. : 0 Min. : -1
## 1st Qu.: 0 1st Qu.:843300 1st Qu.:173445 1st Qu.: 0
## Median : 1500 Median :843300 Median :245217 Median :1039
## Mean : 13848 Mean :753637 Mean :260169 Mean :1124
## 3rd Qu.: 7900 3rd Qu.:843300 3rd Qu.:331582 3rd Qu.:2063
## Max. :843300 Max. :843300 Max. :843300 Max. :3610
## kw_max_avg kw_avg_avg self_reference_min_shares
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 3557 1st Qu.: 2387 1st Qu.: 640
## Median : 4353 Median : 2872 Median : 1200
## Mean : 5664 Mean : 3139 Mean : 4012
## 3rd Qu.: 6020 3rd Qu.: 3600 3rd Qu.: 2600
## Max. :298400 Max. :43568 Max. :843300
## self_reference_max_shares self_reference_avg_sharess weekday_is_monday
## Min. : 0 Min. : 0 Min. :0.0000
## 1st Qu.: 1100 1st Qu.: 986 1st Qu.:0.0000
## Median : 2800 Median : 2200 Median :0.0000
## Mean : 10439 Mean : 6434 Mean :0.1673
## 3rd Qu.: 8000 3rd Qu.: 5200 3rd Qu.:0.0000
## Max. :843300 Max. :843300 Max. :1.0000
## weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
## Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.0000 Median :0.000 Median :0.0000
## Mean :0.1867 Mean :0.188 Mean :0.1822
## 3rd Qu.:0.0000 3rd Qu.:0.000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000
## weekday_is_friday weekday_is_saturday weekday_is_sunday is_weekend
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.1446 Mean :0.0623 Mean :0.06886 Mean :0.1312
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000
## LDA_00 LDA_01 LDA_02 LDA_03
## Min. :0.01818 Min. :0.01818 Min. :0.01818 Min. :0.01818
## 1st Qu.:0.02505 1st Qu.:0.02501 1st Qu.:0.02857 1st Qu.:0.02857
## Median :0.03339 Median :0.03335 Median :0.04000 Median :0.04000
## Mean :0.18486 Mean :0.14082 Mean :0.21579 Mean :0.22336
## 3rd Qu.:0.24068 3rd Qu.:0.15029 3rd Qu.:0.33307 3rd Qu.:0.37331
## Max. :0.92699 Max. :0.92595 Max. :0.92000 Max. :0.92653
## LDA_04 global_subjectivity global_sentiment_polarity
## Min. :0.01818 Min. :0.0000 Min. :-0.38021
## 1st Qu.:0.02857 1st Qu.:0.3964 1st Qu.: 0.05823
## Median :0.04124 Median :0.4540 Median : 0.11958
## Mean :0.23517 Mean :0.4435 Mean : 0.11970
## 3rd Qu.:0.40332 3rd Qu.:0.5083 3rd Qu.: 0.17795
## Max. :0.92719 Max. :1.0000 Max. : 0.65500
## global_rate_positive_words global_rate_negative_words rate_positive_words
## Min. :0.00000 Min. :0.000000 Min. :0.0000
## 1st Qu.:0.02843 1st Qu.:0.009615 1st Qu.:0.6000
## Median :0.03899 Median :0.015332 Median :0.7108
## Mean :0.03959 Mean :0.016580 Mean :0.6826
## 3rd Qu.:0.05017 3rd Qu.:0.021696 3rd Qu.:0.8000
## Max. :0.15549 Max. :0.162037 Max. :1.0000
## rate_negative_words avg_positive_polarity min_positive_polarity
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.1857 1st Qu.:0.3063 1st Qu.:0.05000
## Median :0.2800 Median :0.3591 Median :0.10000
## Mean :0.2877 Mean :0.3543 Mean :0.09551
## 3rd Qu.:0.3824 3rd Qu.:0.4117 3rd Qu.:0.10000
## Max. :1.0000 Max. :1.0000 Max. :1.00000
## max_positive_polarity avg_negative_polarity min_negative_polarity
## Min. :0.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:0.6000 1st Qu.:-0.3283 1st Qu.:-0.7000
## Median :0.8000 Median :-0.2538 Median :-0.5000
## Mean :0.7572 Mean :-0.2596 Mean :-0.5231
## 3rd Qu.:1.0000 3rd Qu.:-0.1872 3rd Qu.:-0.3000
## Max. :1.0000 Max. : 0.0000 Max. : 0.0000
## max_negative_polarity title_subjectivity title_sentiment_polarity
## Min. :-1.0000 Min. :0.0000 Min. :-1.00000
## 1st Qu.:-0.1250 1st Qu.:0.0000 1st Qu.: 0.00000
## Median :-0.1000 Median :0.1500 Median : 0.00000
## Mean :-0.1072 Mean :0.2832 Mean : 0.07293
## 3rd Qu.:-0.0500 3rd Qu.:0.5000 3rd Qu.: 0.15000
## Max. : 0.0000 Max. :1.0000 Max. : 1.00000
## abs_title_subjectivity abs_title_sentiment_polarity targetVar
## Min. :0.0000 Min. :0.000 0:12943
## 1st Qu.:0.1667 1st Qu.:0.000 1:14808
## Median :0.5000 Median :0.000
## Mean :0.3410 Mean :0.158
## 3rd Qu.:0.5000 3rd Qu.:0.250
## Max. :0.5000 Max. :1.000
cbind(freq=table(y_train), percentage=prop.table(table(y_train))*100)
## freq percentage
## 0 12943 46.63976
## 1 14808 53.36024
sapply(xy_train, function(x) sum(is.na(x)))
## n_tokens_title n_tokens_content
## 0 0
## n_unique_tokens n_non_stop_words
## 0 0
## n_non_stop_unique_tokens num_hrefs
## 0 0
## num_self_hrefs num_imgs
## 0 0
## num_videos average_token_length
## 0 0
## num_keywords data_channel_is_lifestyle
## 0 0
## data_channel_is_entertainment data_channel_is_bus
## 0 0
## data_channel_is_socmed data_channel_is_tech
## 0 0
## data_channel_is_world kw_min_min
## 0 0
## kw_max_min kw_avg_min
## 0 0
## kw_min_max kw_max_max
## 0 0
## kw_avg_max kw_min_avg
## 0 0
## kw_max_avg kw_avg_avg
## 0 0
## self_reference_min_shares self_reference_max_shares
## 0 0
## self_reference_avg_sharess weekday_is_monday
## 0 0
## weekday_is_tuesday weekday_is_wednesday
## 0 0
## weekday_is_thursday weekday_is_friday
## 0 0
## weekday_is_saturday weekday_is_sunday
## 0 0
## is_weekend LDA_00
## 0 0
## LDA_01 LDA_02
## 0 0
## LDA_03 LDA_04
## 0 0
## global_subjectivity global_sentiment_polarity
## 0 0
## global_rate_positive_words global_rate_negative_words
## 0 0
## rate_positive_words rate_negative_words
## 0 0
## avg_positive_polarity min_positive_polarity
## 0 0
## max_positive_polarity avg_negative_polarity
## 0 0
## min_negative_polarity max_negative_polarity
## 0 0
## title_subjectivity title_sentiment_polarity
## 0 0
## abs_title_subjectivity abs_title_sentiment_polarity
## 0 0
## targetVar
## 0
# Boxplots for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
boxplot(x_train[,i], main=names(x_train)[i])
}
# Histograms each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
hist(x_train[,i], main=names(x_train)[i])
}
# Density plot for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
plot(density(x_train[,i]), main=names(x_train)[i])
}
# Scatterplot matrix colored by class
# pairs(targetVar~., data=xy_train, col=xy_train$targetVar)
# Box and whisker plots for each attribute by class
# scales <- list(x=list(relation="free"), y=list(relation="free"))
# featurePlot(x=x_train, y=y_train, plot="box", scales=scales)
# Density plots for each attribute by class value
# featurePlot(x=x_train, y=y_train, plot="density", scales=scales)
# Correlation plot
correlations <- cor(x_train)
corrplot(correlations, method="circle")
email_notify(paste("Data Summary and Visualization Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@28864e92}"
Some dataset may require additional preparation activities that will best exposes the structure of the problem and the relationships between the input attributes and the output variable. Some data-prep tasks might include:
# Not applicable for this iteration of the project.
# Mark missing values
#invalid <- 0
#entireDataset$some_col[entireDataset$some_col==invalid] <- NA
# Impute missing values
#entireDataset$some_col <- with(entireDataset, impute(some_col, mean))
# Using the Stochastic Gradient Boosting (GBM) algorithm, we try to rank the attributes' importance.
startTimeModule <- proc.time()
set.seed(seedNum)
library(gbm)
## Loaded gbm 2.1.4
model_fs <- train(targetVar~., data=xy_train, method="gbm", preProcess="scale", trControl=control, verbose=F)
rankedImportance <- varImp(model_fs, scale=FALSE)
print(rankedImportance)
## gbm variable importance
##
## only 20 most important variables shown (out of 58)
##
## Overall
## kw_avg_avg 515.87
## is_weekend 332.33
## self_reference_min_shares 296.91
## kw_max_avg 296.83
## data_channel_is_entertainment 261.67
## self_reference_avg_sharess 239.44
## data_channel_is_tech 217.03
## n_unique_tokens 156.52
## kw_min_avg 153.08
## kw_max_max 137.80
## LDA_02 128.57
## data_channel_is_socmed 112.47
## LDA_00 98.47
## kw_avg_max 90.27
## kw_avg_min 86.09
## num_hrefs 68.24
## LDA_01 56.44
## data_channel_is_world 54.66
## global_subjectivity 54.18
## n_non_stop_unique_tokens 52.86
plot(rankedImportance)
# Set the importance threshold and calculate the list of attributes that don't contribute to the importance threshold
maxThreshold <- 0.99
rankedAttributes <- rankedImportance$importance
rankedAttributes <- rankedAttributes[order(-rankedAttributes$Overall),,drop=FALSE]
totalWeight <- sum(rankedAttributes)
i <- 1
accumWeight <- 0
exit_now <- FALSE
while ((i <= totAttr) & !exit_now) {
accumWeight = accumWeight + rankedAttributes[i,]
if ((accumWeight/totalWeight) >= maxThreshold) {
exit_now <- TRUE
} else {
i <- i + 1
}
}
lowImportance <- rankedAttributes[(i+1):(totAttr),,drop=FALSE]
lowAttributes <- rownames(lowImportance)
cat('Number of attributes contributed to the importance threshold:',i,"\n")
## Number of attributes contributed to the importance threshold: 42
cat('Number of attributes found to be of low importance:',length(lowAttributes))
## Number of attributes found to be of low importance: 16
# Removing the unselected attributes from the training and validation dataframes
xy_train <- xy_train[, !(names(xy_train) %in% lowAttributes)]
xy_test <- xy_test[, !(names(xy_test) %in% lowAttributes)]
# Not applicable for this iteration of the project.
dim(xy_train)
## [1] 27751 43
sapply(xy_train, class)
## n_tokens_title n_tokens_content
## "numeric" "numeric"
## n_unique_tokens n_non_stop_words
## "numeric" "numeric"
## n_non_stop_unique_tokens num_hrefs
## "numeric" "numeric"
## num_self_hrefs num_imgs
## "numeric" "numeric"
## num_videos average_token_length
## "numeric" "numeric"
## data_channel_is_entertainment data_channel_is_socmed
## "numeric" "numeric"
## data_channel_is_tech data_channel_is_world
## "numeric" "numeric"
## kw_min_min kw_max_min
## "numeric" "numeric"
## kw_avg_min kw_min_max
## "numeric" "numeric"
## kw_max_max kw_avg_max
## "numeric" "numeric"
## kw_min_avg kw_max_avg
## "numeric" "numeric"
## kw_avg_avg self_reference_min_shares
## "numeric" "numeric"
## self_reference_max_shares self_reference_avg_sharess
## "numeric" "numeric"
## weekday_is_friday weekday_is_saturday
## "numeric" "numeric"
## is_weekend LDA_00
## "numeric" "numeric"
## LDA_01 LDA_02
## "numeric" "numeric"
## LDA_03 LDA_04
## "numeric" "numeric"
## global_subjectivity global_rate_positive_words
## "numeric" "numeric"
## rate_positive_words rate_negative_words
## "numeric" "numeric"
## min_positive_polarity avg_negative_polarity
## "numeric" "numeric"
## title_sentiment_polarity abs_title_subjectivity
## "numeric" "numeric"
## targetVar
## "factor"
proc.time()-startTimeScript
## user system elapsed
## 283.601 0.972 290.600
email_notify(paste("Data Cleaning and Transformation Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2a18f23c}"
After the data-prep, we next work on finding a workable model by evaluating a subset of machine learning algorithms that are good at exploiting the structure of the training. The typical evaluation tasks include:
For this project, we will evaluate one linear, three non-linear, and three ensemble algorithms:
Linear Algorithm: Logistic Regression
Non-Linear Algorithms: Decision Trees (CART), k-Nearest Neighbors, and Support Vector Machine
Ensemble Algorithms: Bagged CART, Random Forest, and Stochastic Gradient Boosting
The random number seed is reset before each run to ensure that the evaluation of each algorithm is performed using the same data splits. It ensures the results are directly comparable.
# Logistic Regression (Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.glm <- train(targetVar~., data=xy_train, method="glm", metric=metricTarget, trControl=control)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
print(fit.glm)
## Generalized Linear Model
##
## 27751 samples
## 42 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24976, 24976, 24977, 24976, 24976, 24975, ...
## Resampling results:
##
## Accuracy Kappa
## 0.655652 0.3058908
proc.time()-startTimeModule
## user system elapsed
## 8.237 0.086 8.420
email_notify(paste("Logistic Regression Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@ea4a92b}"
# Decision Tree - CART (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.cart <- train(targetVar~., data=xy_train, method="rpart", metric=metricTarget, trControl=control)
print(fit.cart)
## CART
##
## 27751 samples
## 42 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24976, 24976, 24977, 24976, 24976, 24975, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.01977903 0.6130594 0.2149499
## 0.03484509 0.6030766 0.1973039
## 0.13636715 0.5678727 0.1069449
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01977903.
proc.time()-startTimeModule
## user system elapsed
## 14.268 0.011 14.434
email_notify(paste("Decision Tree Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@4563e9ab}"
# k-Nearest Neighbors (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.knn <- train(targetVar~., data=xy_train, method="knn", metric=metricTarget, trControl=control)
print(fit.knn)
## k-Nearest Neighbors
##
## 27751 samples
## 42 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24976, 24976, 24977, 24976, 24976, 24975, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.5713668 0.1390589
## 7 0.5741413 0.1440273
## 9 0.5741413 0.1434159
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.
proc.time()-startTimeModule
## user system elapsed
## 119.877 0.050 121.168
email_notify(paste("k-Nearest Neighbors Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@c818063}"
# Support Vector Machine (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.svm <- train(targetVar~., data=xy_train, method="svmRadial", metric=metricTarget, trControl=control)
print(fit.svm)
## Support Vector Machines with Radial Basis Function Kernel
##
## 27751 samples
## 42 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24976, 24976, 24977, 24976, 24976, 24975, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.6586435 0.3096639
## 0.50 0.6618143 0.3166711
## 1.00 0.6648769 0.3236104
##
## Tuning parameter 'sigma' was held constant at a value of 0.01866281
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.01866281 and C = 1.
proc.time()-startTimeModule
## user system elapsed
## 3067.300 77.111 3180.436
email_notify(paste("Support Vector Machine Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@129a8472}"
In this section, we will explore the use and tuning of ensemble algorithms to see whether we can improve the results.
# Bagged CART (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.bagcart <- train(targetVar~., data=xy_train, method="treebag", metric=metricTarget, trControl=control)
print(fit.bagcart)
## Bagged CART
##
## 27751 samples
## 42 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24976, 24976, 24977, 24976, 24976, 24975, ...
## Resampling results:
##
## Accuracy Kappa
## 0.6488416 0.2923625
proc.time()-startTimeModule
## user system elapsed
## 297.788 0.530 301.401
email_notify(paste("Bagged CART Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@7c30a502}"
# Random Forest (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.rf <- train(targetVar~., data=xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.rf)
## Random Forest
##
## 27751 samples
## 42 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24976, 24976, 24977, 24976, 24976, 24975, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.6741380 0.3408538
## 22 0.6704982 0.3348646
## 42 0.6682283 0.3307585
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
proc.time()-startTimeModule
## user system elapsed
## 3181.550 17.184 3230.688
email_notify(paste("Random Forest Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@b684286}"
# Stochastic Gradient Boosting (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.gbm <- train(targetVar~., data=xy_train, method="gbm", metric=metricTarget, trControl=control, verbose=F)
print(fit.gbm)
## Stochastic Gradient Boosting
##
## 27751 samples
## 42 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24976, 24976, 24977, 24976, 24976, 24975, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.6511111 0.2917341
## 1 100 0.6592191 0.3101098
## 1 150 0.6627148 0.3180982
## 2 50 0.6590028 0.3092768
## 2 100 0.6645524 0.3220146
## 2 150 0.6684799 0.3303902
## 3 50 0.6624622 0.3168486
## 3 100 0.6677594 0.3287643
## 3 150 0.6698135 0.3333561
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
proc.time()-startTimeModule
## user system elapsed
## 180.175 0.306 182.334
email_notify(paste("Stochastic Gradient Boosting Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@17c68925}"
### 4.d) Compare baseline algorithms
results <- resamples(list(LR=fit.glm, CART=fit.cart, kNN=fit.knn, SVM=fit.svm, BagCART=fit.bagcart, RF=fit.rf, GBM=fit.gbm))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: LR, CART, kNN, SVM, BagCART, RF, GBM
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LR 0.6441961 0.6479279 0.6575394 0.6556520 0.6590090 0.6735135 0
## CART 0.6018018 0.6062514 0.6128626 0.6130594 0.6164865 0.6327928 0
## kNN 0.5625225 0.5654567 0.5756757 0.5741413 0.5826126 0.5834234 0
## SVM 0.6418018 0.6570242 0.6681081 0.6648769 0.6712911 0.6879279 0
## BagCART 0.6313514 0.6465183 0.6469105 0.6488416 0.6499682 0.6663063 0
## RF 0.6601802 0.6672372 0.6703299 0.6741380 0.6801802 0.6944144 0
## GBM 0.6547748 0.6618031 0.6719509 0.6698135 0.6773874 0.6828829 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LR 0.2813747 0.2908032 0.3093688 0.3058908 0.3131719 0.3424442 0
## CART 0.1882878 0.2040335 0.2113768 0.2149499 0.2270599 0.2535311 0
## kNN 0.1199373 0.1257108 0.1476646 0.1440273 0.1604046 0.1633571 0
## SVM 0.2760263 0.3068133 0.3306548 0.3236104 0.3361399 0.3704285 0
## BagCART 0.2557445 0.2876354 0.2886970 0.2923625 0.2944650 0.3275340 0
## RF 0.3124632 0.3262607 0.3341703 0.3408538 0.3535053 0.3821219 0
## GBM 0.3032725 0.3179184 0.3377054 0.3333561 0.3491367 0.3605639 0
dotplot(results)
cat('The average accuracy from all models is:',
mean(c(results$values$`LR~Accuracy`,results$values$`CART~Accuracy`,results$values$`kNN~Accuracy`,results$values$`SVM~Accuracy`,results$values$`BagCART~Accuracy`,results$values$`RF~Accuracy`,results$values$`GBM~Accuracy`)))
## The average accuracy from all models is: 0.6429318
After we achieve a short list of machine learning algorithms with good level of accuracy, we can leverage ways to improve the accuracy of the models.
Using the two best-perfoming algorithms from the previous section, we will Search for a combination of parameters for each algorithm that yields the best results.
Finally, we will tune the best-performing algorithms from each group further and see whether we can get more accuracy out of them.
# Tuning algorithm #1 - Random Forest
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(mtry=c(2,3,4,5))
fit.final1 <- train(targetVar~., data=xy_train, method="rf", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final1)
print(fit.final1)
## Random Forest
##
## 27751 samples
## 42 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24976, 24976, 24977, 24976, 24976, 24975, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.6731648 0.3388394
## 3 0.6748581 0.3426844
## 4 0.6742459 0.3418766
## 5 0.6720836 0.3375595
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
proc.time()-startTimeModule
## user system elapsed
## 2991.676 19.916 3041.369
email_notify(paste("Algorithm #1 Tuning Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@96532d6}"
# Tuning algorithm #2 - Stochastic Gradient Boosting
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(.n.trees=c(300,500,700,900), .shrinkage=0.1, .interaction.depth=c(2,3), .n.minobsinnode=10)
fit.final2 <- train(targetVar~., data=xy_train, method="gbm", metric=metricTarget, tuneGrid=grid, trControl=control, verbose=F)
plot(fit.final2)
print(fit.final2)
## Stochastic Gradient Boosting
##
## 27751 samples
## 42 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 24976, 24976, 24977, 24976, 24976, 24975, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 2 300 0.6718675 0.3377429
## 2 500 0.6728763 0.3401466
## 2 700 0.6748220 0.3441818
## 2 900 0.6746781 0.3439168
## 3 300 0.6726242 0.3397446
## 3 500 0.6751105 0.3446334
## 3 700 0.6746782 0.3438693
## 3 900 0.6741737 0.3429761
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 500,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
proc.time()-startTimeModule
## user system elapsed
## 807.642 0.102 815.714
email_notify(paste("Algorithm #2 Tuning Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@1554909b}"
results <- resamples(list(RF=fit.final1, GBM=fit.final2))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: RF, GBM
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.6576577 0.6663359 0.6731532 0.6748581 0.6770241 0.7005405 0
## GBM 0.6590991 0.6719200 0.6749550 0.6751105 0.6811712 0.6915315 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.3071212 0.3253198 0.3396705 0.3426844 0.3472923 0.3954288 0
## GBM 0.3115274 0.3389015 0.3444874 0.3446334 0.3575708 0.3787939 0
dotplot(results)
Once we have narrow down to a model that we believe can make accurate predictions on unseen data, we are ready to finalize it. Finalizing a model may involve sub-tasks such as:
predictions <- predict(fit.final2, newdata=xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3380 1814
## 1 2167 4532
##
## Accuracy : 0.6653
## 95% CI : (0.6567, 0.6737)
## No Information Rate : 0.5336
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3248
## Mcnemar's Test P-Value : 2.421e-08
##
## Sensitivity : 0.6093
## Specificity : 0.7142
## Pos Pred Value : 0.6508
## Neg Pred Value : 0.6765
## Prevalence : 0.4664
## Detection Rate : 0.2842
## Detection Prevalence : 0.4367
## Balanced Accuracy : 0.6617
##
## 'Positive' Class : 0
##
pred <- prediction(as.numeric(predictions), as.numeric(y_test))
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, colorize=TRUE)
auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.6617445
startTimeModule <- proc.time()
library(gbm)
set.seed(seedNum)
# Combining the training and test datasets to form the original dataset that will be used for training the final model
xy_train <- rbind(xy_train, xy_test)
#finalModel <- gbm(targetVar ~ ., data = xy_train, n.trees=700, verbose=F)
grid <- expand.grid(.n.trees=500, .shrinkage=0.1, .interaction.depth=3, .n.minobsinnode=10)
finalModel <- train(targetVar~., data=xy_train, method="gbm", metric=metricTarget, tuneGrid=grid, trControl=control, verbose=F)
print(finalModel)
## Stochastic Gradient Boosting
##
## 39644 samples
## 42 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 35679, 35679, 35680, 35680, 35680, 35680, ...
## Resampling results:
##
## Accuracy Kappa
## 0.6728129 0.3396459
##
## Tuning parameter 'n.trees' was held constant at a value of 500
## 3
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
proc.time()-startTimeModule
## user system elapsed
## 418.730 0.028 422.976
email_notify(paste("Model Validation and Final Model Creation Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@668bc3d5}"
#saveRDS(finalModel, "./finalModel_BinaryClass.rds")
proc.time()-startTimeScript
## user system elapsed
## 11377.265 116.496 11635.195